RFC: support external compiler passes #35015

timholy · 2020-03-05T10:37:16Z

The idea here is to allow packages to define custom optimization passes starting from type-inferred code. The proximal motivation was to enable LoopVectorization to get more information than a macro allows about what types of objects it's working with. The overall design is that you can set :meta statements that optionally bracket the region of code that you want to apply the optimization to, but that the callback function receives the entire method (the Core.Compiler.OptimizationState).

Here, for example, would be the @avx macro:

macro avx(ex)
    esc(Expr(:block, Expr(:meta, Expr(:external_pass, :start, avx_pass)), ex, Expr(:meta, Expr(:external_pass, :stop))))
end

In the CodeInfo, this just brackets ex with

Expr(:meta, Expr(:external_pass, :start, avx_pass))
ex
Expr(:meta, Expr(:external_pass, :stop))

The compiler looks for such meta expressions and then will hand the code to avx_pass, which is obviously where all the magic needs to happen.

After the other passes, this does leave stray :external_pass meta expressions at the end of the CodeInfo. Not sure if I should remove those, but they seem likely to be harmless.

CC @chriselrod.

EDIT: Argh, just realized that I need to modify the iteration here: the pass is likely to change the number of lines, so this needs to start from scratch again after each pass. And the callback should be responsible for removing its :meta expression. But let's see what people think about the general idea before I fix this.

vchuravy · 2020-03-05T14:25:17Z

If I recall correctly when Jarrett originally started working on Cassette untyped IR was chosen because operating on typed IR requires maintaining more invariances and CodeInfo/IRCode are subject/allowed to be changed at an time and building things upon them outside the compiler was intentionally unwanted. Getting at the type information seems to be the primary motivation here. If you want to operate on untyped IR you could use a generated function/Cassette pass/IRTools dynamo today, but operating on typed IR seems to be the motivation here. #33955 is definitely related here, but it follows a more Cassette style approach where the impact of arbitrary code transformation is limited to code that is compiled for a context.

To want to do this through a compiler hook, it seem to me, is actually a sign that this probably needs to be done within the Compiler, but that places the onerous contract of being the compiler upon you.

timholy · 2020-03-05T15:16:38Z

I take your point about being "onerous," though with some extra contributions to TypedCodeUtils it might get less so.

My proximal motivation for thinking about this was thinking about supporting generic element types in LoopVectorization (JuliaSIMD/LoopVectorization.jl#65 (comment)) together with the observation that some bits of code seem like they might simplify if you knew the types. (Of course handling typed code is, as you say, more complicated.)

It does look like #33955 is related. I will have to check it out.

Maybe a good strategy is the following: I'm tired of #9080 (it seems to be down to a 10% penalty for big arrays, but for small arrays it's as high as 40%). What if I play with this and/or #33955 to see if I can write a compiler pass that fixes it, move that into the actual compiler, but meanwhile gain experience with what this would be like in practice.

chriselrod · 2020-03-05T16:25:55Z

My proximal motivation for thinking about this was thinking about supporting generic element types in LoopVectorization (chriselrod/LoopVectorization.jl#65 (comment)) together with the observation that some bits of code seem like they might simplify if you knew the types. (Of course handling typed code is, as you say, more complicated.)

I (or someone else) could file an issue or PR at LoopVectorization if we want to discuss or plan this further.

Originally, @avx was what is now @_avx (I'll fix the bug inappropriately adding $(Expr(:meta,:inline))) directly transformed the loop:

julia> @macroexpand @_avx for i ∈ eachindex(x), j ∈ eachindex(y)
           s += x[i] * A[i,j] * y[j]
       end
quote
    $(Expr(:meta, :inline))
    begin
        var"##loopeachindexi#270" = LoopVectorization.maybestaticrange(eachindex(x))
        var"##i_loop_lower_bound#271" = LoopVectorization.staticm1(first(var"##loopeachindexi#270"))
        var"##i_loop_upper_bound#272" = last(var"##loopeachindexi#270")
        var"##loopeachindexj#273" = LoopVectorization.maybestaticrange(eachindex(y))
        var"##j_loop_lower_bound#274" = LoopVectorization.staticm1(first(var"##loopeachindexj#273"))
        var"##j_loop_upper_bound#275" = last(var"##loopeachindexj#273")
        var"##vptr##_x" = LoopVectorization.stridedpointer(x)
        var"##vptr##_A" = LoopVectorization.stridedpointer(A)
        var"##vptr##_y" = LoopVectorization.stridedpointer(y)
        var"##T#269" = promote_type(eltype(x), eltype(A), eltype(y))
        var"##W#268" = LoopVectorization.pick_vector_width_val(eltype(x), eltype(A), eltype(y))
        var"##s_" = s
        var"##mask##" = LoopVectorization.masktable(var"##W#268", LoopVectorization.valrem(var"##W#268", var"##i_loop_upper_bound#272" - var"##i_loop_lower_bound#271"))
        var"##s_0" = LoopVectorization.vzero(var"##W#268", typeof(var"##s_"))
        var"##s_1" = LoopVectorization.vzero(var"##W#268", typeof(var"##s_"))
        var"##s_2" = LoopVectorization.vzero(var"##W#268", typeof(var"##s_"))
        var"##s_3" = LoopVectorization.vzero(var"##W#268", typeof(var"##s_"))
    end
    begin
        $(Expr(:gc_preserve, :(begin
            var"##outer##j##outer##" = LoopVectorization.unwrap(var"##j_loop_lower_bound#274")
            while var"##outer##j##outer##" < var"##j_loop_upper_bound#275" - 3
                i = LoopVectorization._MM(var"##W#268", var"##i_loop_lower_bound#271")
                j = var"##outer##j##outer##"
                var"####tempload#279_0_" = LoopVectorization.vload(var"##vptr##_y", (j,))
                j += 1
                var"####tempload#279_1_" = LoopVectorization.vload(var"##vptr##_y", (j,))
                j += 1
                var"####tempload#279_2_" = LoopVectorization.vload(var"##vptr##_y", (j,))
                j += 1
                var"####tempload#279_3_" = LoopVectorization.vload(var"##vptr##_y", (j,))
                begin
                    while i < var"##i_loop_upper_bound#272" - LoopVectorization.valmuladd(var"##W#268", 2, -1)
                        var"####tempload#276_0" = LoopVectorization.vload(var"##vptr##_x", (i,))
                        var"####tempload#276_1" = LoopVectorization.vload(var"##vptr##_x", (i + LoopVectorization.valmul(var"##W#268", 1),))
                        j = var"##outer##j##outer##"
                        var"####tempload#278_0_0" = LoopVectorization.vload(var"##vptr##_A", (i, j))
                        var"####tempload#278_0_1" = LoopVectorization.vload(var"##vptr##_A", (i + LoopVectorization.valmul(var"##W#268", 1), j))
                        var"####temporary#277_0_0" = LoopVectorization.vmul(var"####tempload#278_0_0", var"####tempload#279_0_")
                        var"####temporary#277_0_1" = LoopVectorization.vmul(var"####tempload#278_0_1", var"####tempload#279_0_")
                        var"##s_0" = LoopVectorization.vfmadd231(var"####tempload#276_0", var"####temporary#277_0_0", var"##s_0")
                        var"##s_1" = LoopVectorization.vfmadd231(var"####tempload#276_1", var"####temporary#277_0_1", var"##s_1")
                        j += 1
                        var"####tempload#278_1_0" = LoopVectorization.vload(var"##vptr##_A", (i, j))
                        var"####tempload#278_1_1" = LoopVectorization.vload(var"##vptr##_A", (i + LoopVectorization.valmul(var"##W#268", 1), j))
                        var"####temporary#277_1_0" = LoopVectorization.vmul(var"####tempload#278_1_0", var"####tempload#279_1_")
                        var"####temporary#277_1_1" = LoopVectorization.vmul(var"####tempload#278_1_1", var"####tempload#279_1_")
                        var"##s_2" = LoopVectorization.vfmadd231(var"####tempload#276_0", var"####temporary#277_1_0", var"##s_2")
                        var"##s_3" = LoopVectorization.vfmadd231(var"####tempload#276_1", var"####temporary#277_1_1", var"##s_3")
                        j += 1
                        var"####tempload#278_2_0" = LoopVectorization.vload(var"##vptr##_A", (i, j))
                        var"####tempload#278_2_1" = LoopVectorization.vload(var"##vptr##_A", (i + LoopVectorization.valmul(var"##W#268", 1), j))
                        var"####temporary#277_2_0" = LoopVectorization.vmul(var"####tempload#278_2_0", var"####tempload#279_2_")
                        var"####temporary#277_2_1" = LoopVectorization.vmul(var"####tempload#278_2_1", var"####tempload#279_2_")
                        var"##s_0" = LoopVectorization.vfmadd231(var"####tempload#276_0", var"####temporary#277_2_0", var"##s_0")
                        var"##s_1" = LoopVectorization.vfmadd231(var"####tempload#276_1", var"####temporary#277_2_1", var"##s_1")
                        j += 1
                        var"####tempload#278_3_0" = LoopVectorization.vload(var"##vptr##_A", (i, j))
                        var"####tempload#278_3_1" = LoopVectorization.vload(var"##vptr##_A", (i + LoopVectorization.valmul(var"##W#268", 1), j))
                        var"####temporary#277_3_0" = LoopVectorization.vmul(var"####tempload#278_3_0", var"####tempload#279_3_")
                        var"####temporary#277_3_1" = LoopVectorization.vmul(var"####tempload#278_3_1", var"####tempload#279_3_")
                        var"##s_2" = LoopVectorization.vfmadd231(var"####tempload#276_0", var"####temporary#277_3_0", var"##s_2")
                        var"##s_3" = LoopVectorization.vfmadd231(var"####tempload#276_1", var"####temporary#277_3_1", var"##s_3")
                        i += LoopVectorization.valmul(var"##W#268", 2)
                    end
...

Later, a new macro was added and given the old name, as it is recommended over the old one). This one also creates a LoopSet object, but then converts it to a type parameter so that it can punt code generation to a generated function:

julia> @macroexpand @avx for i ∈ eachindex(x), j ∈ eachindex(y)
           s += x[i] * A[i,j] * y[j]
       end
quote
    var"##loopeachindexi#282" = LoopVectorization.maybestaticrange(eachindex(x))
    var"##i_loop_lower_bound#283" = LoopVectorization.staticm1(first(var"##loopeachindexi#282"))
    var"##i_loop_upper_bound#284" = last(var"##loopeachindexi#282")
    var"##loopeachindexj#285" = LoopVectorization.maybestaticrange(eachindex(y))
    var"##j_loop_lower_bound#286" = LoopVectorization.staticm1(first(var"##loopeachindexj#285"))
    var"##j_loop_upper_bound#287" = last(var"##loopeachindexj#285")
    var"##vptr##_x" = LoopVectorization.stridedpointer(x)
    var"##vptr##_A" = LoopVectorization.stridedpointer(A)
    var"##vptr##_y" = LoopVectorization.stridedpointer(y)
    local var"##s_0"
    begin
        $(Expr(:gc_preserve, :(var"##s_0" = LoopVectorization._avx_!(Val{(0, 0)}(), Tuple{:LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x01), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x02), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x03), :LoopVectorization, :vmul, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000203, LoopVectorization.compute, 0x00, 0x04), :LoopVectorization, Symbol("##254"), LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000012, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x05), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000012, 0x0000000000000000, 0x0000000000010405, LoopVectorization.compute, 0x00, 0x05)}, Tuple{LoopVectorization.ArrayRefStruct(0x0000000000000001, 0x0000000000000001, 0x0000000000000030), LoopVectorization.ArrayRefStruct(0x0000000000000101, 0x0000000000000102, 0xffffffffffffb068), LoopVectorization.ArrayRefStruct(0x0000000000000001, 0x0000000000000002, 0xffffffffffffffb0)}, Tuple{0, Tuple{6}, Tuple{5}, Tuple{}, Tuple{}, Tuple{}, Tuple{}}, (var"##i_loop_lower_bound#283":var"##i_loop_upper_bound#284", var"##j_loop_lower_bound#286":var"##j_loop_upper_bound#287"), var"##vptr##_x", var"##vptr##_A", var"##vptr##_y", s)), :x, :A, :y))
    end
    s = LoopVectorization.reduced_add(var"##s_0", s)
end

The type parameter provides all the information needed to reconstruct the LoopSet. While reconstructing, it can use type information, e.g. to figure out if an array was transposed or is a SubArray with non-unit first stride

Perhaps we could deprecate @_avx, and then focus on supporting this API. That would be enough to try and simplify the "register_single_loop" function, which would then simply ensure an iterable existed for passing to the generated function, ideally also carrying information in parametric types if possible (e.g., substitute integer literals like 4 with Static{4}(), and replace functions like size(A, 2) with maybestaticsize(A, Val{2}()) to allow StaticArrays the option to add methods).

While I think there's a lot of room for further taking advantage of that, it would be great to also be able to do something like this:

julia> x = rand(ComplexF64); y = rand(ComplexF64);

julia> @code_typed x * y
CodeInfo(
1 ─ %1  = Base.getfield(z, :re)::Float64
│   %2  = Base.getfield(w, :re)::Float64
│   %3  = Base.mul_float(%1, %2)::Float64
│   %4  = Base.getfield(z, :im)::Float64
│   %5  = Base.getfield(w, :im)::Float64
│   %6  = Base.mul_float(%4, %5)::Float64
│   %7  = Base.sub_float(%3, %6)::Float64
│   %8  = Base.getfield(z, :re)::Float64
│   %9  = Base.getfield(w, :im)::Float64
│   %10 = Base.mul_float(%8, %9)::Float64
│   %11 = Base.getfield(z, :im)::Float64
│   %12 = Base.getfield(w, :re)::Float64
│   %13 = Base.mul_float(%11, %12)::Float64
│   %14 = Base.add_float(%10, %13)::Float64
│   %15 = %new(Complex{Float64}, %7, %14)::Complex{Float64}
└──       return %15
) => Complex{Float64}

which would allow supporting a much broader arrange of number types.

tkf · 2020-03-05T21:55:03Z

I'm tired of #9080 (it seems to be down to a 10% penalty for big arrays, but for small arrays it's as high as 40%).

This issue is way beyond my understanding but, regarding #9080, I think it should be pretty easy to make iteration over CartesianIndices as fast as manually written nested loop by using the functional foreach/foldl/reduce approach (ref https://tkf.github.io/juliacon2019-transducers/#66 slides 66 to 71). This won't even require writing a @generated function. Did you look into foreach-based approach?

tkf · 2020-03-06T01:23:50Z

I just realized my comment is rather off-topic here. I re-posted a similar comment in #9080 (comment) with an MWE.

timholy · 2020-03-06T07:50:44Z

@chriselrod, thanks for the detailed explanation! Very informative. I see the system you've built is more flexible than I credited.

I think one of my motivations here is trying to provide a smooth path for this kind of transformation happening automatically without requiring an @avx annotation. But I'm not certain that's a great idea because some of the loop transformations are likely to be expensive in terms of compile time, and it's useful to have a way of marking a block as being worth extra effort in compilation.

Relatedly, to me it seems that one possible advantage of writing this as a compiler pass rather than a generated function is reduced compile times. I may not be thinking about this properly, but the idea is that if you write it as a pass, then you only have to compile the pass itself once, and it can make the transformations on an infinite number of functions. Conversely, if you do this via a generated function, then each instance of @avx requires expansion of the generated function (surely different for each use of @avx) and the full inference pass on a more complicated function.

If you try an experiment you'll see there's some (but not fully convincing) evidence for this viewpoint: if you redefine the functions in the tests here and then time the first execution (i.e., including the compile time), on my machine the ones with @avx annotations are 3-4x slower than old2d!. In some ways, 3-4x is a lot but it's not as big as I expected, especially when you account for the fact that even in my scheme of moving this into the compiler there will be some (unpredictable as yet) cost to making those transformations.

Bottom line, now that I understand more about your approach I'm more sanguine about continuing to help advance it, though I still suspect that moving more of this into the compiler is the better long-term solution.

timholy · 2020-03-11T14:20:54Z

So, having used this for #35074, my feeling is: we want this or something like it (maybe #33955). BUT, if folks are concerned about it opening a back door to a bunch of poorly-written, crashy passes, I'd be just as happy deleting the tests and leaving this code in place, but commented out (and with a link to a gist or demo that shows how to use it). That way people who try new compiler passes can just build julia from source and uncomment the code to make it easier (much, much easier) to develop their pass. Then they can contribute their pass to Core.Compiler.

oxinabox · 2020-03-11T14:54:52Z

test/compiler/irpasses.jl

+# External passes
+const _opt = Ref{Any}(nothing)
+avx_pass(opt) = (_opt[] = opt; opt)
+macro avx(ex)


because this doesn't actually do anything to do with avx it would be clearer maybe to call this
demo_pass and @with_demo_pass ?

If we go with the "comment out" option this will be deleted anyway. But yes, for the demo we need better names.

vchuravy · 2020-03-11T15:05:13Z

BUT, if folks are concerned about it opening a back door to a bunch of poorly-written, crashy passes,

Yes! Pass-ordering is challenging and I rather not make it more complex ;)

oxinabox · 2020-03-11T15:12:16Z

So the function of the pass in this PR gets and optimization state, (which includes basically everything from a @code_typed compiler pass and a little bit more),
and mutates it, which lets it do anything.

One of the things i would like in general when writting Cassette passes,
and that should be available to use by the stage when this runs,
is the control-flow-graph, and dom-tree.

However, making them available to external passes opens up the possibility that a pass will make them incorrect, via updating the optimization state, without also updating the control flow graph / dom-tree

timholy · 2020-03-11T15:51:24Z

We could put more such meta blocks in: :external_typed_pass, :external_ir_pass, :external_domtree_pass, and :external_inlined_pass? I'd be happy to change this to :external_typed_pass and let other folks add the others.

tkf · 2020-04-23T22:45:13Z

There are some compiler features that I'm interested in playing with. I wonder if it's implementable with a pass mechanism like this (or something similar).

Auto type stabilization

Consider I have a function like this

@autostabilize function f()
    var1 = expr1
    var2 = expr2  # cannot be inferred
    var3 = expr3
    var4 = expr4
end

Can I implement the compiler hint @autostabilize that transform above code to something equivalent to

function f()
    var1 = expr1
    var2 = expr2
    g(var1, var2)
end

function g(var1, var2)
    var3 = expr3
    var4 = expr4
end

?

Tail-calls to finite-state machines (TC-to-FSM)

As I discussed in Tail-call optimization and function-barrier -based accumulation in loops - Internals & Design - JuliaLang, I think optimizing tail-calls has a grounded motivation as a natural extension to the commonly-used mutate-or-widen technique. Is it possible to implement a compiler pass that transforms tail-calls (which may be dispatched to different methods) to a finite-state machine, using this mechanism? It'd be also nice to inject a limit to the number of times function-barrier is used in a loop, as mentioned in Discourse, to avoid stack overflow.

(If I can make both @autostabilize and TC-to-FSM work it's probably unnecessary to write tail-calls manually which is even better.)

Edit: Actually, I guess what I want to do doesn't require a typed IR. I guess it's already doable with something like IRTools.jl (or directly with the hacks it's using)?

Edit 2: Yep, there is already https://github.com/MikeInnes/IRTools.jl/blob/6d227c0edb828b7c761c97fe899dd33c03e69b56/examples/continuations.jl :)

vtjnash · 2021-05-29T04:01:36Z

Now have AbstractInterpreter

timholy added needs docs Documentation for this change is required needs news A NEWS entry is required for this change labels Mar 5, 2020

RFC: support external compiler passes

0435e02

timholy force-pushed the teh/external_passes branch from 598eeea to 0435e02 Compare March 11, 2020 13:55

timholy mentioned this pull request Mar 11, 2020

Add a compiler pass for iterate(::CartesianIndices) #35074

Closed

oxinabox reviewed Mar 11, 2020

View reviewed changes

tshort mentioned this pull request Mar 24, 2020

Julia PR #33955 tshort/StaticCompiler.jl#30

Closed

vtjnash closed this May 29, 2021

vtjnash deleted the teh/external_passes branch May 29, 2021 04:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: support external compiler passes #35015

RFC: support external compiler passes #35015

timholy commented Mar 5, 2020 •

edited

Loading

vchuravy commented Mar 5, 2020

timholy commented Mar 5, 2020

chriselrod commented Mar 5, 2020

tkf commented Mar 5, 2020

tkf commented Mar 6, 2020

timholy commented Mar 6, 2020

timholy commented Mar 11, 2020 •

edited

Loading

oxinabox Mar 11, 2020

timholy Mar 11, 2020

vchuravy commented Mar 11, 2020

oxinabox commented Mar 11, 2020

timholy commented Mar 11, 2020

tkf commented Apr 23, 2020 •

edited

Loading

vtjnash commented May 29, 2021

RFC: support external compiler passes #35015

RFC: support external compiler passes #35015

Conversation

timholy commented Mar 5, 2020 • edited Loading

vchuravy commented Mar 5, 2020

timholy commented Mar 5, 2020

chriselrod commented Mar 5, 2020

tkf commented Mar 5, 2020

tkf commented Mar 6, 2020

timholy commented Mar 6, 2020

timholy commented Mar 11, 2020 • edited Loading

oxinabox Mar 11, 2020

Choose a reason for hiding this comment

timholy Mar 11, 2020

Choose a reason for hiding this comment

vchuravy commented Mar 11, 2020

oxinabox commented Mar 11, 2020

timholy commented Mar 11, 2020

tkf commented Apr 23, 2020 • edited Loading

Auto type stabilization

Tail-calls to finite-state machines (TC-to-FSM)

vtjnash commented May 29, 2021

timholy commented Mar 5, 2020 •

edited

Loading

timholy commented Mar 11, 2020 •

edited

Loading

tkf commented Apr 23, 2020 •

edited

Loading